Week 1 lecture notes

Reviewing introductory papers on interpretability and linguistic probes inside the black box of neural language models

The Barest Thought of an Intro to Neural Nets

A brief recent history of neural networks
- Neural networks are mathematical objects — for doing computation
  - The most common types can be boiled down to simple matrix multiplication - the forward pass
  - Models can be hard-wired or learned, typically using gradient-based methods like backpropagation
  - General framing: The model will try to learn some mapping from the input (e.g., some vector representing the pixels of an image) to an output (i.e., a prediction, such as a single number, or a vector of numbers, such as class probabilities)
  - Simplest multi-layer models (e.g., the multilayer perceptron) perform two stages of multiplication — some of which employ nonlinear transformations of an intermediate state
    In some parts of the literature, especially older connectionist modeling papers, these transformations are sometimes called “activation functions”
    Nonlinearities allow models to learn statistically interesting conjunctions of features — XOR problem
    These conjunctions of features are also interactions — the same as the use of the term elsewhere in statistics — in which the value at one level depends on the value at some other level (e.g., a + b + ab)
    Linguistic structure is highly interactive — there are usually multiple sources of information that influence how we interpret language
- 5 years after Mikolov et al. (2013) - foundational word2vec paper, what was the state of research in NLP? A variety of models, e.g., ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), GPT-2 somewhere in there, and many, many more that have surfaced
- Movement from recurrent structures (e.g., RNNs and LSTMs) to attention-based computations using Transformer architectures (e.g., Vaswani et al., 2017)
- Terminological note: RNNs I use to refer to models that only have a recurrence mechanism to hold onto prior hidden states; LSTMs (with forget gates) are a very different architecture; Transformers learn in a similar way to RNNs and LSTMs but have no recurrence; predictions are computed simultaneously
- Neural network models have massively grown in size and numbers of parameters

Big questions about neural networks:
- What is in the input and output of these models? = What encoding representations are we using? What assumptions does using those representations make?
- What can and do the models learn from the data?
- How are they generally trained?

Readings

Alishahi, A., Chrupała, G., & Linzen, T. (2019). Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering, 25(4), 543-557. https://www.cambridge.org/core/journals/natural-language-engineering/article/analyzing-and-interpreting-neural-networks-for-nlp-a-report-on-the-first-blackboxnlp-workshop/FAFF1B645BBF89FE400A521526AA65D4

Notes

“Octopus paper” - Bender and Koller (2020) (“Climbing towards NLU”)

Wide variety of reasons to want interpretable models
- Stakeholders in a business
- Accountability for legal reasons (e.g., California or the EU)

“Black box” —> BlackboxNLP

Approaches outlined in BlackboxNLP
- Developing annotated and specialized datasets to test models
- Manipulation of the input to neural networks to test for importance of specific linguistic or demographic features
- Developing diagnostic classifiers trained over intermediate representations from within a neural network model
- Modifying neural network architectures to make them more explainable —> Simplify or distill the model to a smaller state
- Designing training or testing datasets over simplified or formal languages

Input manipulation
- Punctuation
- Tokenization
- Lemmatization
- Chunking
- Datasets
  - Diverse NLI - model must answer logical/semantic questions of varying linguistic complexity
  - GLUE - Benchmark dataset for different domains
  - Human reference points, e.g., children’s behavior in theory-of-mind experiments
  - Sentences of varying types of linguistic complexity (e.g., subject-verb agreement tests)

Developing diagnostic classifiers
- Auxiliary task - Some other task (e.g., sentiment analysis)
- Diagnostic classifiers - Is the presence or absence of a linguistic feature “in” the encoding/embedding/vector representation?
  - Can leverage the predictions of diagnostic classifiers to “nudge” a trained model in a more linguistic direction
  - Part-of-speech classifiers (e.g., NOUN, ADJ, VERB, PUNCT)
  - Subject-verb agreement (“The key(s) [to the cabinet(s)] is/are on the table”)
- Nearest neighbors with a notion of conformity (Wallace, Feng, & Boyd-Graber, 2018) — removing a feature (e.g., a word from a passage) can influence overall representation
- Probing
- Decoding

Modifying neural network architectures

Simplified or formal languages
- Cross-linguistic transfer between a large corpus and a small corpus to see how original learned representations do/do not get preserved when training on a “new” language
- Formal languages
  - Recognizing whether a string is valid in some formal system or not
  - $a^nb^n$  languages require a pushdown automaton — something that can keep track of the location of a previous state — complex interaction between activation functions (ReLU, GRU + LSTM or plain recurrent architecture)
  - RNNs and LSTMs perform poorly in parsing Dyck languages (matching open and closed brackets) on strings longer than what they were trained on

Desirable future links
- Evaluation - “When an explanation matches what a human would see as a reasonable basis of a particular decision, it does not necessarily follow that this was the basis”
- Benchmarks
- Neuroscientific alignment
  - Growing area in natural language processing!

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842-866. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT

Notes

Syntactic knowledge

Define each of the following:
- Linear versus hierarchical structure (e.g., “The cat the dog is sleeping next to is cute”)
- Part-of-speech information (e.g., NOUN, ADJECTIVE, etc.)
- Syntactic chunks (what sequences go together)
- Roles (e.g., subject, object, arguments, adjuncts)
- Named entity categories (memorization)
- Pragmatic inference
- Event knowledge
- Syntactic relations (e.g., syntactic dependencies)
- Subject-verb agreement
- Anaphora

Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys. Just Accepted (June 2022). https://doi.org/10.1145/3546577

Notes

Motivations for interpretability
- “incompleteness in the problem formalization”
- Accountability
- Safety
- Ethics
- Scientific understanding

Communication strategies in the interpretability literature
- Local explanations (single observations)
- Global explanations (the whole model)
- Class explanations (multiple observations from a single class)

Intrinsic interpretability

Post-hoc interpretability - models that are built after an NLP system is trained to interpret its behavior

Measures of interpretability
- Application-grounded - e.g., higher survival rates when doctors in conjunction with an AI save more lives than doctors (or AIs!) alone
- Functionally-grounded - comparing with other post-hoc methods or intrinsically interpretable model (e.g., a linear model)
- Human-grounded - An estimate of the utility to people in general (vs. researcher intuitions), e.g., the model people choose as the most accurate